import requests
import pandas as pd
import json
import plotly.express as px
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
1) Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
archive_df = pd.read_csv('twitter-archive-enhanced.csv')
2) Use the Requests library to download the tweet image prediction (image_predictions.tsv)
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image-predictions.tsv', mode='wb') as file :
file.write(response.content)
predictions_df = pd.read_csv('image-predictions.tsv', sep='\t')
3) Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
# Unable to query additional data via the Twitter API
json_data = []
with open('tweet-json.txt', 'r') as file :
for line in file :
json_data.append(json.loads(line))
tweet_df = pd.DataFrame(json_data)
In this section, detect and document at least eight (8) quality issues and two (2) tidiness issue. You must use both visual assessment programmatic assessement to assess the data.
Note : pay attention to the following key points when you access the data.
You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.
archive_df.head()
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
| 2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
| 3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
| 4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
predictions_df
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
| 1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
| 2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
| 3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
| 4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
| 2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
| 2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
| 2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
| 2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
2075 rows × 12 columns
tweet_df.head()
| created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | in_reply_to_status_id | ... | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status_id_str | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Aug 01 16:23:56 +0000 2017 | 892420643555336193 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | [0, 85] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 39467 | False | False | False | False | en | NaN | NaN | NaN | NaN |
| 1 | Tue Aug 01 00:17:27 +0000 2017 | 892177421306343426 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | [0, 138] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 33819 | False | False | False | False | en | NaN | NaN | NaN | NaN |
| 2 | Mon Jul 31 00:18:03 +0000 2017 | 891815181378084864 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | [0, 121] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 25461 | False | False | False | False | en | NaN | NaN | NaN | NaN |
| 3 | Sun Jul 30 15:58:51 +0000 2017 | 891689557279858688 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | [0, 79] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 42908 | False | False | False | False | en | NaN | NaN | NaN | NaN |
| 4 | Sat Jul 29 16:00:24 +0000 2017 | 891327558926688256 | 891327558926688256 | This is Franklin. He would like you to stop ca... | False | [0, 138] | {'hashtags': [{'text': 'BarkWeek', 'indices': ... | {'media': [{'id': 891327551943041024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 41048 | False | False | False | False | en | NaN | NaN | NaN | NaN |
5 rows × 31 columns
tweet_df[['in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count']].head()
| in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | user | geo | coordinates | place | contributors | is_quote_status | retweet_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | None | NaN | None | None | {'id': 4196983835, 'id_str': '4196983835', 'na... | None | None | None | None | False | 8853 |
| 1 | None | NaN | None | None | {'id': 4196983835, 'id_str': '4196983835', 'na... | None | None | None | None | False | 6514 |
| 2 | None | NaN | None | None | {'id': 4196983835, 'id_str': '4196983835', 'na... | None | None | None | None | False | 4328 |
| 3 | None | NaN | None | None | {'id': 4196983835, 'id_str': '4196983835', 'na... | None | None | None | None | False | 8964 |
| 4 | None | NaN | None | None | {'id': 4196983835, 'id_str': '4196983835', 'na... | None | None | None | None | False | 9774 |
user_df = pd.DataFrame(list(tweet_df['user']))
user_df.head(4)
| id | id_str | name | screen_name | location | description | url | entities | protected | followers_count | ... | profile_sidebar_fill_color | profile_text_color | profile_use_background_image | has_extended_profile | default_profile | default_profile_image | following | follow_request_sent | notifications | translator_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4196983835 | 4196983835 | WeRateDogs™ (author) | dog_rates | DM YOUR DOGS, WE WILL RATE | #1 Source for Professional Dog Ratings | STORE... | https://t.co/N7sNNHAEXS | {'url': {'urls': [{'url': 'https://t.co/N7sNNH... | False | 3200889 | ... | 000000 | 000000 | False | True | False | False | True | False | False | none |
| 1 | 4196983835 | 4196983835 | WeRateDogs™ (author) | dog_rates | DM YOUR DOGS, WE WILL RATE | #1 Source for Professional Dog Ratings | STORE... | https://t.co/N7sNNHAEXS | {'url': {'urls': [{'url': 'https://t.co/N7sNNH... | False | 3200889 | ... | 000000 | 000000 | False | True | False | False | True | False | False | none |
| 2 | 4196983835 | 4196983835 | WeRateDogs™ (author) | dog_rates | DM YOUR DOGS, WE WILL RATE | #1 Source for Professional Dog Ratings | STORE... | https://t.co/N7sNNHAEXS | {'url': {'urls': [{'url': 'https://t.co/N7sNNH... | False | 3200889 | ... | 000000 | 000000 | False | True | False | False | True | False | False | none |
| 3 | 4196983835 | 4196983835 | WeRateDogs™ (author) | dog_rates | DM YOUR DOGS, WE WILL RATE | #1 Source for Professional Dog Ratings | STORE... | https://t.co/N7sNNHAEXS | {'url': {'urls': [{'url': 'https://t.co/N7sNNH... | False | 3200889 | ... | 000000 | 000000 | False | True | False | False | True | False | False | none |
4 rows × 42 columns
user_df[['friends_count', 'listed_count', 'created_at', 'favourites_count', 'utc_offset', 'time_zone', 'geo_enabled', 'verified', 'statuses_count', 'lang', 'contributors_enabled', 'is_translator', 'is_translation_enabled', 'profile_background_color', 'profile_background_image_url', 'profile_background_image_url_https', 'profile_background_tile', 'profile_image_url', 'profile_image_url_https', 'profile_banner_url']].head()
| friends_count | listed_count | created_at | favourites_count | utc_offset | time_zone | geo_enabled | verified | statuses_count | lang | contributors_enabled | is_translator | is_translation_enabled | profile_background_color | profile_background_image_url | profile_background_image_url_https | profile_background_tile | profile_image_url | profile_image_url_https | profile_banner_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 104 | 2784 | Sun Nov 15 21:41:29 +0000 2015 | 114031 | None | None | True | True | 5288 | en | False | False | False | 000000 | http://abs.twimg.com/images/themes/theme1/bg.png | https://abs.twimg.com/images/themes/theme1/bg.png | False | http://pbs.twimg.com/profile_images/8614153285... | https://pbs.twimg.com/profile_images/861415328... | https://pbs.twimg.com/profile_banners/41969838... |
| 1 | 104 | 2784 | Sun Nov 15 21:41:29 +0000 2015 | 114031 | None | None | True | True | 5288 | en | False | False | False | 000000 | http://abs.twimg.com/images/themes/theme1/bg.png | https://abs.twimg.com/images/themes/theme1/bg.png | False | http://pbs.twimg.com/profile_images/8614153285... | https://pbs.twimg.com/profile_images/861415328... | https://pbs.twimg.com/profile_banners/41969838... |
| 2 | 104 | 2784 | Sun Nov 15 21:41:29 +0000 2015 | 114031 | None | None | True | True | 5288 | en | False | False | False | 000000 | http://abs.twimg.com/images/themes/theme1/bg.png | https://abs.twimg.com/images/themes/theme1/bg.png | False | http://pbs.twimg.com/profile_images/8614153285... | https://pbs.twimg.com/profile_images/861415328... | https://pbs.twimg.com/profile_banners/41969838... |
| 3 | 104 | 2784 | Sun Nov 15 21:41:29 +0000 2015 | 114031 | None | None | True | True | 5288 | en | False | False | False | 000000 | http://abs.twimg.com/images/themes/theme1/bg.png | https://abs.twimg.com/images/themes/theme1/bg.png | False | http://pbs.twimg.com/profile_images/8614153285... | https://pbs.twimg.com/profile_images/861415328... | https://pbs.twimg.com/profile_banners/41969838... |
| 4 | 104 | 2784 | Sun Nov 15 21:41:29 +0000 2015 | 114031 | None | None | True | True | 5288 | en | False | False | False | 000000 | http://abs.twimg.com/images/themes/theme1/bg.png | https://abs.twimg.com/images/themes/theme1/bg.png | False | http://pbs.twimg.com/profile_images/8614153285... | https://pbs.twimg.com/profile_images/861415328... | https://pbs.twimg.com/profile_banners/41969838... |
user_df[['profile_sidebar_border_color', 'profile_link_color']]
| profile_sidebar_border_color | profile_link_color | |
|---|---|---|
| 0 | 000000 | F5ABB5 |
| 1 | 000000 | F5ABB5 |
| 2 | 000000 | F5ABB5 |
| 3 | 000000 | F5ABB5 |
| 4 | 000000 | F5ABB5 |
| ... | ... | ... |
| 2349 | 000000 | F5ABB5 |
| 2350 | 000000 | F5ABB5 |
| 2351 | 000000 | F5ABB5 |
| 2352 | 000000 | F5ABB5 |
| 2353 | 000000 | F5ABB5 |
2354 rows × 2 columns
user_df.query('name != "WeRateDogs™ (author)"')
| id | id_str | name | screen_name | location | description | url | entities | protected | followers_count | ... | profile_sidebar_fill_color | profile_text_color | profile_use_background_image | has_extended_profile | default_profile | default_profile_image | following | follow_request_sent | notifications | translator_type |
|---|
0 rows × 42 columns
archive_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
predictions_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
tweet_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 created_at 2354 non-null object 1 id 2354 non-null int64 2 id_str 2354 non-null object 3 full_text 2354 non-null object 4 truncated 2354 non-null bool 5 display_text_range 2354 non-null object 6 entities 2354 non-null object 7 extended_entities 2073 non-null object 8 source 2354 non-null object 9 in_reply_to_status_id 78 non-null float64 10 in_reply_to_status_id_str 78 non-null object 11 in_reply_to_user_id 78 non-null float64 12 in_reply_to_user_id_str 78 non-null object 13 in_reply_to_screen_name 78 non-null object 14 user 2354 non-null object 15 geo 0 non-null object 16 coordinates 0 non-null object 17 place 1 non-null object 18 contributors 0 non-null object 19 is_quote_status 2354 non-null bool 20 retweet_count 2354 non-null int64 21 favorite_count 2354 non-null int64 22 favorited 2354 non-null bool 23 retweeted 2354 non-null bool 24 possibly_sensitive 2211 non-null object 25 possibly_sensitive_appealable 2211 non-null object 26 lang 2354 non-null object 27 retweeted_status 179 non-null object 28 quoted_status_id 29 non-null float64 29 quoted_status_id_str 29 non-null object 30 quoted_status 28 non-null object dtypes: bool(4), float64(3), int64(3), object(21) memory usage: 505.9+ KB
archive_df.describe()
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | retweeted_status_id | retweeted_status_user_id | rating_numerator | rating_denominator | |
|---|---|---|---|---|---|---|---|
| count | 2.356000e+03 | 7.800000e+01 | 7.800000e+01 | 1.810000e+02 | 1.810000e+02 | 2356.000000 | 2356.000000 |
| mean | 7.427716e+17 | 7.455079e+17 | 2.014171e+16 | 7.720400e+17 | 1.241698e+16 | 13.126486 | 10.455433 |
| std | 6.856705e+16 | 7.582492e+16 | 1.252797e+17 | 6.236928e+16 | 9.599254e+16 | 45.876648 | 6.745237 |
| min | 6.660209e+17 | 6.658147e+17 | 1.185634e+07 | 6.661041e+17 | 7.832140e+05 | 0.000000 | 0.000000 |
| 25% | 6.783989e+17 | 6.757419e+17 | 3.086374e+08 | 7.186315e+17 | 4.196984e+09 | 10.000000 | 10.000000 |
| 50% | 7.196279e+17 | 7.038708e+17 | 4.196984e+09 | 7.804657e+17 | 4.196984e+09 | 11.000000 | 10.000000 |
| 75% | 7.993373e+17 | 8.257804e+17 | 4.196984e+09 | 8.203146e+17 | 4.196984e+09 | 12.000000 | 10.000000 |
| max | 8.924206e+17 | 8.862664e+17 | 8.405479e+17 | 8.874740e+17 | 7.874618e+17 | 1776.000000 | 170.000000 |
predictions_df.describe()
| tweet_id | img_num | p1_conf | p2_conf | p3_conf | |
|---|---|---|---|---|---|
| count | 2.075000e+03 | 2075.000000 | 2075.000000 | 2.075000e+03 | 2.075000e+03 |
| mean | 7.384514e+17 | 1.203855 | 0.594548 | 1.345886e-01 | 6.032417e-02 |
| std | 6.785203e+16 | 0.561875 | 0.271174 | 1.006657e-01 | 5.090593e-02 |
| min | 6.660209e+17 | 1.000000 | 0.044333 | 1.011300e-08 | 1.740170e-10 |
| 25% | 6.764835e+17 | 1.000000 | 0.364412 | 5.388625e-02 | 1.622240e-02 |
| 50% | 7.119988e+17 | 1.000000 | 0.588230 | 1.181810e-01 | 4.944380e-02 |
| 75% | 7.932034e+17 | 1.000000 | 0.843855 | 1.955655e-01 | 9.180755e-02 |
| max | 8.924206e+17 | 4.000000 | 1.000000 | 4.880140e-01 | 2.734190e-01 |
tweet_df.describe()
| id | in_reply_to_status_id | in_reply_to_user_id | retweet_count | favorite_count | quoted_status_id | |
|---|---|---|---|---|---|---|
| count | 2.354000e+03 | 7.800000e+01 | 7.800000e+01 | 2354.000000 | 2354.000000 | 2.900000e+01 |
| mean | 7.426978e+17 | 7.455079e+17 | 2.014171e+16 | 3164.797366 | 8080.968564 | 8.162686e+17 |
| std | 6.852812e+16 | 7.582492e+16 | 1.252797e+17 | 5284.770364 | 11814.771334 | 6.164161e+16 |
| min | 6.660209e+17 | 6.658147e+17 | 1.185634e+07 | 0.000000 | 0.000000 | 6.721083e+17 |
| 25% | 6.783975e+17 | 6.757419e+17 | 3.086374e+08 | 624.500000 | 1415.000000 | 7.888183e+17 |
| 50% | 7.194596e+17 | 7.038708e+17 | 4.196984e+09 | 1473.500000 | 3603.500000 | 8.340867e+17 |
| 75% | 7.993058e+17 | 8.257804e+17 | 4.196984e+09 | 3652.000000 | 10122.250000 | 8.664587e+17 |
| max | 8.924206e+17 | 8.862664e+17 | 8.405479e+17 | 79515.000000 | 132810.000000 | 8.860534e+17 |
all_columns = pd.Series(list(archive_df) + list(predictions_df) + list(tweet_df))
all_columns[all_columns.duplicated()]
17 tweet_id 37 source 38 in_reply_to_status_id 40 in_reply_to_user_id dtype: object
archive_df.duplicated().sum()
0
archive_df.isnull().sum()
tweet_id 0 in_reply_to_status_id 2278 in_reply_to_user_id 2278 timestamp 0 source 0 text 0 retweeted_status_id 2175 retweeted_status_user_id 2175 retweeted_status_timestamp 2175 expanded_urls 59 rating_numerator 0 rating_denominator 0 name 0 doggo 0 floofer 0 pupper 0 puppo 0 dtype: int64
archive_df['rating_denominator'].value_counts()
10 2333 11 3 50 3 80 2 20 2 2 1 16 1 40 1 70 1 15 1 90 1 110 1 120 1 130 1 150 1 170 1 7 1 0 1 Name: rating_denominator, dtype: int64
archive_df[archive_df['in_reply_to_status_id'].isnull()]
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
| 2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
| 3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
| 4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | 5 | 10 | None | None | None | None | None |
| 2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | 6 | 10 | a | None | None | None | None |
| 2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | 9 | 10 | a | None | None | None | None |
| 2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | 7 | 10 | a | None | None | None | None |
| 2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | 8 | 10 | None | None | None | None | None |
2278 rows × 17 columns
predictions_df.isnull().sum()
tweet_id 0 jpg_url 0 img_num 0 p1 0 p1_conf 0 p1_dog 0 p2 0 p2_conf 0 p2_dog 0 p3 0 p3_conf 0 p3_dog 0 dtype: int64
predictions_df.duplicated().sum()
0
predictions_df['img_num'].value_counts()
1 1780 2 198 3 66 4 31 Name: img_num, dtype: int64
predictions_df.query('p1_dog == False').query('p2_dog == False').query('p3_dog == False')
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 666051853826850816 | https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg | 1 | box_turtle | 0.933012 | False | mud_turtle | 0.045885 | False | terrapin | 0.017885 | False |
| 17 | 666104133288665088 | https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg | 1 | hen | 0.965932 | False | cock | 0.033919 | False | partridge | 0.000052 | False |
| 18 | 666268910803644416 | https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg | 1 | desktop_computer | 0.086502 | False | desk | 0.085547 | False | bookcase | 0.079480 | False |
| 21 | 666293911632134144 | https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg | 1 | three-toed_sloth | 0.914671 | False | otter | 0.015250 | False | great_grey_owl | 0.013207 | False |
| 25 | 666362758909284353 | https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg | 1 | guinea_pig | 0.996496 | False | skunk | 0.002402 | False | hamster | 0.000461 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021 | 880935762899988482 | https://pbs.twimg.com/media/DDm2Z5aXUAEDS2u.jpg | 1 | street_sign | 0.251801 | False | umbrella | 0.115123 | False | traffic_light | 0.069534 | False |
| 2022 | 881268444196462592 | https://pbs.twimg.com/media/DDrk-f9WAAI-WQv.jpg | 1 | tusker | 0.473303 | False | Indian_elephant | 0.245646 | False | ibex | 0.055661 | False |
| 2046 | 886680336477933568 | https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg | 1 | convertible | 0.738995 | False | sports_car | 0.139952 | False | car_wheel | 0.044173 | False |
| 2052 | 887517139158093824 | https://pbs.twimg.com/ext_tw_video_thumb/88751... | 1 | limousine | 0.130432 | False | tow_truck | 0.029175 | False | shopping_cart | 0.026321 | False |
| 2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
324 rows × 12 columns
tweet_df['id'].duplicated().sum()
0
tweet_df.isnull().sum()
created_at 0 id 0 id_str 0 full_text 0 truncated 0 display_text_range 0 entities 0 extended_entities 281 source 0 in_reply_to_status_id 2276 in_reply_to_status_id_str 2276 in_reply_to_user_id 2276 in_reply_to_user_id_str 2276 in_reply_to_screen_name 2276 user 0 geo 2354 coordinates 2354 place 2353 contributors 2354 is_quote_status 0 retweet_count 0 favorite_count 0 favorited 0 retweeted 0 possibly_sensitive 143 possibly_sensitive_appealable 143 lang 0 retweeted_status 2175 quoted_status_id 2325 quoted_status_id_str 2325 quoted_status 2326 dtype: int64
tweet_df.query('retweeted == True')
| created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | in_reply_to_status_id | ... | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status_id_str | quoted_status |
|---|
0 rows × 31 columns
archive_df dataset :¶1) Some values of columns are non-assigned numbers.
2) The values of columns of ids (tweet_id, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id and quoted_status_id) are floats or integers not strings.
3) The values of column source have link's tags.
4) Some of values of column name are in lowercase.
5) The colmuns timestamp and retweeted_status_timestamp are not in "datetime" format.
predictions_df dataset :¶6) The values of column tweet_id are integers not strings.
7) Some of values of columns p1, p2 et p3 are in lowercase.
tweet_df dataset :¶8) Some values of columns are non-assigned numbers.
9) The values of columns of ids (id, in_reply_to_status_id and in_reply_to_user_id) are floats or integers not strings.
10) The columns id_str, in_reply_to_status_id_str, in_reply_to_user_id_str and quoted_status_id_str are duplicates of the columns id, in_reply_to_status_id, in_reply_to_user_id and quoted_status_id.
11) The values of column display_text_range are intervalles.
12) The colomns geo, coordinates and contributors are empty.
13) The values of column lang are in lowercase.
14) The values of column source have link's tags.
15) The columns id and full_text should be renamed tweet_id and text respectively.
16) The columns entities, exrtended_entities and user should be removed.
In this section, clean all of the issues you documented while assessing.
Note : Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).
# Make copies of original pieces of data
archive_clean = archive_df.copy()
predictions_clean = predictions_df.copy()
tweet_clean = tweet_df.copy()
# Removing retweets
archive_clean = archive_clean[archive_clean['in_reply_to_status_id'].isnull()]
tweet_clean = tweet_clean[tweet_clean['in_reply_to_status_id'].isnull()]
# Removing rating not about dogs
archive_clean = archive_clean.query('rating_denominator == 10')
# Removing tweets without dogs
predictions_clean = predictions_clean.query('p1_dog == True' or 'p2_dog == True' or 'p3_dog == True')
archive_clean.fillna('None', inplace=True)
Test
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2260 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2260 non-null int64 1 in_reply_to_status_id 2260 non-null object 2 in_reply_to_user_id 2260 non-null object 3 timestamp 2260 non-null object 4 source 2260 non-null object 5 text 2260 non-null object 6 retweeted_status_id 2260 non-null object 7 retweeted_status_user_id 2260 non-null object 8 retweeted_status_timestamp 2260 non-null object 9 expanded_urls 2260 non-null object 10 rating_numerator 2260 non-null int64 11 rating_denominator 2260 non-null int64 12 name 2260 non-null object 13 doggo 2260 non-null object 14 floofer 2260 non-null object 15 pupper 2260 non-null object 16 puppo 2260 non-null object dtypes: int64(3), object(14) memory usage: 317.8+ KB
archive_df dataset, are floats or integers not strings.¶Define :
IDs are better as strings, so we will redefine their type as object.
Code
archive_clean[['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id']] = archive_clean[['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id']].astype(object)
Test
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2260 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2260 non-null object 1 in_reply_to_status_id 2260 non-null object 2 in_reply_to_user_id 2260 non-null object 3 timestamp 2260 non-null object 4 source 2260 non-null object 5 text 2260 non-null object 6 retweeted_status_id 2260 non-null object 7 retweeted_status_user_id 2260 non-null object 8 retweeted_status_timestamp 2260 non-null object 9 expanded_urls 2260 non-null object 10 rating_numerator 2260 non-null int64 11 rating_denominator 2260 non-null int64 12 name 2260 non-null object 13 doggo 2260 non-null object 14 floofer 2260 non-null object 15 pupper 2260 non-null object 16 puppo 2260 non-null object dtypes: int64(2), object(15) memory usage: 317.8+ KB
archive_df dataset, have link's tags.¶Define :
Those values are linked to their download folder. Then we must remove the link's tags.
Code
source_list = list(archive_clean['source'])
sources = []
for line in source_list :
sources.append(line.split('>', 1)[1].split('<')[0])
archive_clean['source'] = pd.Series(sources)
Test
archive_clean['source'].value_counts()
Twitter for iPhone 2037 Vine - Make a Scene 86 Twitter Web Client 32 TweetDeck 11 Name: source, dtype: int64
archive_df dataset, are in lowercase.¶Define :
Names start with capital letters, so we will redefine their strings as titles.
Code
archive_clean['name'] = archive_clean['name'].str.title()
Test
archive_clean['name'].value_counts()
None 656
A 54
Charlie 12
Lucy 11
Cooper 11
...
Caryl 1
Billl 1
Tebow 1
Gerbald 1
Brandy 1
Name: name, Length: 953, dtype: int64
archive_df dataset, are not in "datetime" format.¶Define :
We will redefine their type as 'datetime'.
Code
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])
archive_clean['retweeted_status_timestamp'] = pd.to_datetime(archive_clean['retweeted_status_timestamp'].replace("None", "NaN"), utc=True)
Test
archive_clean.head(4)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | None | None | 2017-08-01 16:23:56+00:00 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve... | None | None | NaT | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
| 1 | 892177421306343426 | None | None | 2017-08-01 00:17:27+00:00 | Twitter for iPhone | This is Tilly. She's just checking pup on you.... | None | None | NaT | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
| 2 | 891815181378084864 | None | None | 2017-07-31 00:18:03+00:00 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin... | None | None | NaT | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
| 3 | 891689557279858688 | None | None | 2017-07-30 15:58:51+00:00 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal... | None | None | NaT | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
predictions_df dataset, are integers not strings.¶Define :
One more time, ids are better as strings. We will redefine their type as object.
Code
predictions_clean['tweet_id'] = predictions_clean['tweet_id'].astype(object)
Test
predictions_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1532 entries, 0 to 2073 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1532 non-null object 1 jpg_url 1532 non-null object 2 img_num 1532 non-null int64 3 p1 1532 non-null object 4 p1_conf 1532 non-null float64 5 p1_dog 1532 non-null bool 6 p2 1532 non-null object 7 p2_conf 1532 non-null float64 8 p2_dog 1532 non-null bool 9 p3 1532 non-null object 10 p3_conf 1532 non-null float64 11 p3_dog 1532 non-null bool dtypes: bool(3), float64(3), int64(1), object(5) memory usage: 124.2+ KB
predictions_df dataset, are in lowercase.¶Define :
They are names, so we will redefine their strings as titles.
Code
p_list = ['p1', 'p2', 'p3']
for p in p_list :
predictions_clean[p] = predictions_clean[p].str.title()
Test
predictions_clean[['p1', 'p2', 'p3']]
| p1 | p2 | p3 | |
|---|---|---|---|
| 0 | Welsh_Springer_Spaniel | Collie | Shetland_Sheepdog |
| 1 | Redbone | Miniature_Pinscher | Rhodesian_Ridgeback |
| 2 | German_Shepherd | Malinois | Bloodhound |
| 3 | Rhodesian_Ridgeback | Redbone | Miniature_Pinscher |
| 4 | Miniature_Pinscher | Rottweiler | Doberman |
| ... | ... | ... | ... |
| 2068 | Appenzeller | Border_Collie | Ice_Lolly |
| 2069 | Chesapeake_Bay_Retriever | Irish_Terrier | Indian_Elephant |
| 2070 | Basset | English_Springer | German_Short-Haired_Pointer |
| 2072 | Chihuahua | Malamute | Kelpie |
| 2073 | Chihuahua | Pekinese | Papillon |
1532 rows × 3 columns
tweet_df dataset, are non-assigned numbers.¶Define :
We will replace all of them with the empty object ('None').
Code
tweet_clean.fillna('None', inplace=True)
Test
tweet_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2276 entries, 0 to 2353 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 created_at 2276 non-null object 1 id 2276 non-null int64 2 id_str 2276 non-null object 3 full_text 2276 non-null object 4 truncated 2276 non-null bool 5 display_text_range 2276 non-null object 6 entities 2276 non-null object 7 extended_entities 2276 non-null object 8 source 2276 non-null object 9 in_reply_to_status_id 2276 non-null object 10 in_reply_to_status_id_str 2276 non-null object 11 in_reply_to_user_id 2276 non-null object 12 in_reply_to_user_id_str 2276 non-null object 13 in_reply_to_screen_name 2276 non-null object 14 user 2276 non-null object 15 geo 2276 non-null object 16 coordinates 2276 non-null object 17 place 2276 non-null object 18 contributors 2276 non-null object 19 is_quote_status 2276 non-null bool 20 retweet_count 2276 non-null int64 21 favorite_count 2276 non-null int64 22 favorited 2276 non-null bool 23 retweeted 2276 non-null bool 24 possibly_sensitive 2276 non-null object 25 possibly_sensitive_appealable 2276 non-null object 26 lang 2276 non-null object 27 retweeted_status 2276 non-null object 28 quoted_status_id 2276 non-null object 29 quoted_status_id_str 2276 non-null object 30 quoted_status 2276 non-null object dtypes: bool(4), int64(3), object(24) memory usage: 506.8+ KB
tweet_df dataset, are floats or integers not strings.¶Define :
We will redefine their type as object.
Code
tweet_clean[['id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'quoted_status_id']] = tweet_clean[['id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'quoted_status_id']].astype(object)
Test
tweet_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2276 entries, 0 to 2353 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 created_at 2276 non-null object 1 id 2276 non-null object 2 id_str 2276 non-null object 3 full_text 2276 non-null object 4 truncated 2276 non-null bool 5 display_text_range 2276 non-null object 6 entities 2276 non-null object 7 extended_entities 2276 non-null object 8 source 2276 non-null object 9 in_reply_to_status_id 2276 non-null object 10 in_reply_to_status_id_str 2276 non-null object 11 in_reply_to_user_id 2276 non-null object 12 in_reply_to_user_id_str 2276 non-null object 13 in_reply_to_screen_name 2276 non-null object 14 user 2276 non-null object 15 geo 2276 non-null object 16 coordinates 2276 non-null object 17 place 2276 non-null object 18 contributors 2276 non-null object 19 is_quote_status 2276 non-null bool 20 retweet_count 2276 non-null int64 21 favorite_count 2276 non-null int64 22 favorited 2276 non-null bool 23 retweeted 2276 non-null bool 24 possibly_sensitive 2276 non-null object 25 possibly_sensitive_appealable 2276 non-null object 26 lang 2276 non-null object 27 retweeted_status 2276 non-null object 28 quoted_status_id 2276 non-null object 29 quoted_status_id_str 2276 non-null object 30 quoted_status 2276 non-null object dtypes: bool(4), int64(2), object(25) memory usage: 506.8+ KB
tweet_df dataset, are duplicates of the columns id, in_reply_to_status_id and in_reply_to_user_id.¶Define :
We must remove them and keep the originals.
Code
tweet_clean.drop(['id_str', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'quoted_status_id_str'], axis=1, inplace=True)
Test
tweet_clean.head(4)
| created_at | id | full_text | truncated | display_text_range | entities | extended_entities | source | in_reply_to_status_id | in_reply_to_user_id | ... | retweet_count | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Aug 01 16:23:56 +0000 2017 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | [0, 85] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 8853 | 39467 | False | False | False | False | en | None | None | None |
| 1 | Tue Aug 01 00:17:27 +0000 2017 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | [0, 138] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 6514 | 33819 | False | False | False | False | en | None | None | None |
| 2 | Mon Jul 31 00:18:03 +0000 2017 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | [0, 121] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 4328 | 25461 | False | False | False | False | en | None | None | None |
| 3 | Sun Jul 30 15:58:51 +0000 2017 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | [0, 79] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 8964 | 42908 | False | False | False | False | en | None | None | None |
4 rows × 27 columns
tweet_df dataset, are intervalles.¶Define :
Code
size_list = list(tweet_clean['display_text_range'])
text_sizes = []
for line in size_list :
text_sizes.append(line[1])
tweet_clean.insert(4, 'text_size', pd.Series(text_sizes))
tweet_clean.drop('display_text_range', axis=1, inplace=True)
Test
tweet_clean.head(4)
| created_at | id | full_text | truncated | text_size | entities | extended_entities | source | in_reply_to_status_id | in_reply_to_user_id | ... | retweet_count | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Aug 01 16:23:56 +0000 2017 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | 85.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 8853 | 39467 | False | False | False | False | en | None | None | None |
| 1 | Tue Aug 01 00:17:27 +0000 2017 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | 138.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 6514 | 33819 | False | False | False | False | en | None | None | None |
| 2 | Mon Jul 31 00:18:03 +0000 2017 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | 121.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 4328 | 25461 | False | False | False | False | en | None | None | None |
| 3 | Sun Jul 30 15:58:51 +0000 2017 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | 79.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 8964 | 42908 | False | False | False | False | en | None | None | None |
4 rows × 27 columns
tweet_df dataset, are empty.¶Define :
Since they don't have any data, we must delete those columns.
Code
tweet_clean.drop(['geo', 'coordinates', 'contributors'], axis=1, inplace=True)
Test
tweet_clean.head(4)
| created_at | id | full_text | truncated | text_size | entities | extended_entities | source | in_reply_to_status_id | in_reply_to_user_id | ... | retweet_count | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Aug 01 16:23:56 +0000 2017 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | 85.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 8853 | 39467 | False | False | False | False | en | None | None | None |
| 1 | Tue Aug 01 00:17:27 +0000 2017 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | 138.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 6514 | 33819 | False | False | False | False | en | None | None | None |
| 2 | Mon Jul 31 00:18:03 +0000 2017 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | 121.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 4328 | 25461 | False | False | False | False | en | None | None | None |
| 3 | Sun Jul 30 15:58:51 +0000 2017 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | 79.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | None | None | ... | 8964 | 42908 | False | False | False | False | en | None | None | None |
4 rows × 24 columns
tweet_df dataset, are in lowercase.¶Define :
We will redefine their strings as titles.
Code
tweet_clean['lang'] = tweet_clean['lang'].str.title()
Test
tweet_clean['lang'].value_counts()
En 2265 In 3 Nl 3 Und 2 Ro 1 Eu 1 Et 1 Name: lang, dtype: int64
tweet_df dataset, have link's tags.¶Define :
One more time, those values are linked to their download folder. Then we must remove the link's tags.
Code
source_list = list(tweet_clean['source'])
sources = []
for line in source_list :
sources.append(line.split('>', 1)[1].split('<')[0])
tweet_clean['source'] = pd.Series(sources)
Test
tweet_clean['source'].value_counts()
Twitter for iPhone 2068 Vine - Make a Scene 88 Twitter Web Client 32 TweetDeck 11 Name: source, dtype: int64
tweet_df dataset, should be renamed tweet_id and text respectively.¶Define :
There are standard names for all the twitter dataset. So id and full_text should be tweet_id and text respectively.
Code
tweet_clean.rename(columns = {'id' : 'tweet_id', 'full_text' : 'text'}, inplace=True)
Test
tweet_clean.head(4)
| created_at | tweet_id | text | truncated | text_size | entities | extended_entities | source | in_reply_to_status_id | in_reply_to_user_id | ... | retweet_count | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Aug 01 16:23:56 +0000 2017 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | 85.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | Twitter for iPhone | None | None | ... | 8853 | 39467 | False | False | False | False | En | None | None | None |
| 1 | Tue Aug 01 00:17:27 +0000 2017 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | 138.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | Twitter for iPhone | None | None | ... | 6514 | 33819 | False | False | False | False | En | None | None | None |
| 2 | Mon Jul 31 00:18:03 +0000 2017 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | 121.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | Twitter for iPhone | None | None | ... | 4328 | 25461 | False | False | False | False | En | None | None | None |
| 3 | Sun Jul 30 15:58:51 +0000 2017 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | 79.0 | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | Twitter for iPhone | None | None | ... | 8964 | 42908 | False | False | False | False | En | None | None | None |
4 rows × 24 columns
tweet_df dataset, should be removed.¶Define :
Since they are very dirty, we must delete those columns.
Code
tweet_clean.drop(['entities', 'extended_entities', 'user'], axis=1, inplace=True)
Test
tweet_clean.head(4)
| created_at | tweet_id | text | truncated | text_size | source | in_reply_to_status_id | in_reply_to_user_id | in_reply_to_screen_name | place | ... | retweet_count | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Aug 01 16:23:56 +0000 2017 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | 85.0 | Twitter for iPhone | None | None | None | None | ... | 8853 | 39467 | False | False | False | False | En | None | None | None |
| 1 | Tue Aug 01 00:17:27 +0000 2017 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | 138.0 | Twitter for iPhone | None | None | None | None | ... | 6514 | 33819 | False | False | False | False | En | None | None | None |
| 2 | Mon Jul 31 00:18:03 +0000 2017 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | 121.0 | Twitter for iPhone | None | None | None | None | ... | 4328 | 25461 | False | False | False | False | En | None | None | None |
| 3 | Sun Jul 30 15:58:51 +0000 2017 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | 79.0 | Twitter for iPhone | None | None | None | None | ... | 8964 | 42908 | False | False | False | False | En | None | None | None |
4 rows × 21 columns
tweet_df dataset, should be split into creation_date and creation_time columns.¶Define :
Code
creation_list = list(tweet_clean['created_at'])
creation = [], []
for line in creation_list :
creation[0].append(line.split(' ')[0] + ' ' + line.split(' ')[1] + ' ' + line.split(' ')[2] + ' ' + line.split(' ')[5])
creation[1].append(line.split(' ')[3])
tweet_clean.insert(0, 'creation_date', pd.to_datetime(pd.Series(creation[0])))
tweet_clean.insert(1, 'creation_time', pd.Series(creation[1]))
# Removing tweets beyond Aug 1st, 2017
tweet_clean = tweet_clean.query('creation_date <= "2017-08-01"')
tweet_clean.drop('created_at', axis=1, inplace=True)
Test
tweet_clean.head(4)
| creation_date | creation_time | tweet_id | text | truncated | text_size | source | in_reply_to_status_id | in_reply_to_user_id | in_reply_to_screen_name | ... | retweet_count | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-08-01 | 16:23:56 | 892420643555336193 | This is Phineas. He's a mystical boy. Only eve... | False | 85.0 | Twitter for iPhone | None | None | None | ... | 8853 | 39467 | False | False | False | False | En | None | None | None |
| 1 | 2017-08-01 | 00:17:27 | 892177421306343426 | This is Tilly. She's just checking pup on you.... | False | 138.0 | Twitter for iPhone | None | None | None | ... | 6514 | 33819 | False | False | False | False | En | None | None | None |
| 2 | 2017-07-31 | 00:18:03 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | 121.0 | Twitter for iPhone | None | None | None | ... | 4328 | 25461 | False | False | False | False | En | None | None | None |
| 3 | 2017-07-30 | 15:58:51 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | 79.0 | Twitter for iPhone | None | None | None | ... | 8964 | 42908 | False | False | False | False | En | None | None | None |
4 rows × 22 columns
twitter_archive_master dataset.¶Define :
According to the Storing Data section, we need one master dataset from the three datasets archive_clean, predictions_clean and tweet_clean.
Code
twitter_archive = archive_clean.merge(tweet_clean, how='inner')
twitter_archive_master = twitter_archive.merge(predictions_clean, how='inner')
Test 1
twitter_archive_master.head(4)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | ... | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892177421306343426 | None | None | 2017-08-01 00:17:27+00:00 | Twitter for iPhone | This is Tilly. She's just checking pup on you.... | None | None | NaT | https://twitter.com/dog_rates/status/892177421... | ... | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | Papillon | 0.068957 | True |
| 1 | 891815181378084864 | None | None | 2017-07-31 00:18:03+00:00 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin... | None | None | NaT | https://twitter.com/dog_rates/status/891815181... | ... | 1 | Chihuahua | 0.716012 | True | Malamute | 0.078253 | True | Kelpie | 0.031379 | True |
| 2 | 891327558926688256 | None | None | 2017-07-29 16:00:24+00:00 | Twitter for iPhone | This is Franklin. He would like you to stop ca... | None | None | NaT | https://twitter.com/dog_rates/status/891327558... | ... | 2 | Basset | 0.555712 | True | English_Springer | 0.225770 | True | German_Short-Haired_Pointer | 0.175219 | True |
| 3 | 891087950875897856 | None | None | 2017-07-29 00:08:17+00:00 | Twitter for iPhone | Here we have a majestic great white breaching ... | None | None | NaT | https://twitter.com/dog_rates/status/891087950... | ... | 1 | Chesapeake_Bay_Retriever | 0.425595 | True | Irish_Terrier | 0.116317 | True | Indian_Elephant | 0.076902 | False |
4 rows × 45 columns
twitter_archive_master[['rating_numerator', 'rating_denominator', 'name', 'doggo','floofer', 'pupper', 'puppo', 'creation_date', 'creation_time', 'truncated', 'text_size', 'place', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited']]
| rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | creation_date | creation_time | truncated | text_size | place | is_quote_status | retweet_count | favorite_count | favorited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | 10 | Tilly | None | None | None | None | 2017-08-01 | 00:17:27 | False | 138.0 | None | False | 6514 | 33819 | False |
| 1 | 12 | 10 | Archie | None | None | None | None | 2017-07-31 | 00:18:03 | False | 121.0 | None | False | 4328 | 25461 | False |
| 2 | 12 | 10 | Franklin | None | None | None | None | 2017-07-29 | 16:00:24 | False | 138.0 | None | False | 9774 | 41048 | False |
| 3 | 13 | 10 | None | None | None | None | None | 2017-07-29 | 00:08:17 | False | 138.0 | None | False | 3261 | 20562 | False |
| 4 | 13 | 10 | Jax | None | None | None | None | 2017-07-28 | 16:27:12 | False | 140.0 | None | False | 2158 | 12041 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1287 | 12 | 10 | None | None | None | None | None | 2015-11-16 | 23:23:41 | False | 137.0 | None | False | 68 | 303 | False |
| 1288 | 12 | 10 | None | None | None | None | None | 2015-11-16 | 21:10:36 | False | 140.0 | None | False | 101 | 346 | False |
| 1289 | 8 | 10 | None | None | None | None | None | 2015-11-16 | 20:32:58 | False | 135.0 | None | False | 358 | 736 | False |
| 1290 | 8 | 10 | A | None | None | None | None | 2015-11-16 | 19:31:45 | False | 139.0 | None | False | 61 | 243 | False |
| 1291 | 11 | 10 | Olive | None | None | None | None | 2015-11-16 | 16:11:11 | False | 136.0 | None | False | 162 | 398 | False |
1292 rows × 16 columns
twitter_archive_master[['retweeted', 'possibly_sensitive', 'possibly_sensitive_appealable', 'lang', 'retweeted_status', 'quoted_status_id', 'quoted_status']]
| retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status | |
|---|---|---|---|---|---|---|---|
| 0 | False | False | False | En | None | None | None |
| 1 | False | False | False | En | None | None | None |
| 2 | False | False | False | En | None | None | None |
| 3 | False | False | False | En | None | None | None |
| 4 | False | False | False | En | None | None | None |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1287 | False | False | False | En | None | None | None |
| 1288 | False | False | False | En | None | None | None |
| 1289 | False | False | False | En | None | None | None |
| 1290 | False | False | False | En | None | None | None |
| 1291 | False | False | False | En | None | None | None |
1292 rows × 7 columns
Test 2 :
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1292 entries, 0 to 1291 Data columns (total 45 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1292 non-null object 1 in_reply_to_status_id 1292 non-null object 2 in_reply_to_user_id 1292 non-null object 3 timestamp 1292 non-null datetime64[ns, UTC] 4 source 1292 non-null object 5 text 1292 non-null object 6 retweeted_status_id 1292 non-null object 7 retweeted_status_user_id 1292 non-null object 8 retweeted_status_timestamp 50 non-null datetime64[ns, UTC] 9 expanded_urls 1292 non-null object 10 rating_numerator 1292 non-null int64 11 rating_denominator 1292 non-null int64 12 name 1292 non-null object 13 doggo 1292 non-null object 14 floofer 1292 non-null object 15 pupper 1292 non-null object 16 puppo 1292 non-null object 17 creation_date 1292 non-null datetime64[ns] 18 creation_time 1292 non-null object 19 truncated 1292 non-null bool 20 text_size 1292 non-null float64 21 in_reply_to_screen_name 1292 non-null object 22 place 1292 non-null object 23 is_quote_status 1292 non-null bool 24 retweet_count 1292 non-null int64 25 favorite_count 1292 non-null int64 26 favorited 1292 non-null bool 27 retweeted 1292 non-null bool 28 possibly_sensitive 1292 non-null object 29 possibly_sensitive_appealable 1292 non-null object 30 lang 1292 non-null object 31 retweeted_status 1292 non-null object 32 quoted_status_id 1292 non-null object 33 quoted_status 1292 non-null object 34 jpg_url 1292 non-null object 35 img_num 1292 non-null int64 36 p1 1292 non-null object 37 p1_conf 1292 non-null float64 38 p1_dog 1292 non-null bool 39 p2 1292 non-null object 40 p2_conf 1292 non-null float64 41 p2_dog 1292 non-null bool 42 p3 1292 non-null object 43 p3_conf 1292 non-null float64 44 p3_dog 1292 non-null bool dtypes: bool(7), datetime64[ns, UTC](2), datetime64[ns](1), float64(4), int64(5), object(26) memory usage: 402.5+ KB
Test 3 :
twitter_archive_master.describe()
| rating_numerator | rating_denominator | text_size | retweet_count | favorite_count | img_num | p1_conf | p2_conf | p3_conf | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1292.000000 | 1292.0 | 1292.000000 | 1292.000000 | 1292.000000 | 1292.000000 | 1292.000000 | 1292.000000 | 1.292000e+03 |
| mean | 11.164861 | 10.0 | 111.829721 | 3235.156347 | 9817.326625 | 1.229876 | 0.615054 | 0.141318 | 6.167530e-02 |
| std | 2.480581 | 0.0 | 26.275348 | 5218.044040 | 12860.737037 | 0.596093 | 0.259599 | 0.102442 | 5.262065e-02 |
| min | 2.000000 | 10.0 | 13.000000 | 45.000000 | 0.000000 | 1.000000 | 0.055379 | 0.000023 | 2.160900e-07 |
| 25% | 10.000000 | 10.0 | 94.000000 | 776.000000 | 2265.000000 | 1.000000 | 0.392428 | 0.057067 | 1.496200e-02 |
| 50% | 11.000000 | 10.0 | 116.000000 | 1736.000000 | 4958.500000 | 1.000000 | 0.614697 | 0.128910 | 4.952715e-02 |
| 75% | 12.000000 | 10.0 | 137.000000 | 3762.000000 | 13003.750000 | 1.000000 | 0.850830 | 0.202873 | 9.613192e-02 |
| max | 75.000000 | 10.0 | 148.000000 | 79515.000000 | 132810.000000 | 4.000000 | 0.999956 | 0.467678 | 2.710420e-01 |
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".
twitter_archive_master.to_csv('twitter_archive_master.csv', index=False)
In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.
# Displaying dogs with dog ratings, their count as favorite, the text size and the images number of the tweet
twitter_archive_master[['name', 'rating_numerator', 'favorite_count', 'text_size', 'img_num']]
| name | rating_numerator | favorite_count | text_size | img_num | |
|---|---|---|---|---|---|
| 0 | Tilly | 13 | 33819 | 138.0 | 1 |
| 1 | Archie | 12 | 25461 | 121.0 | 1 |
| 2 | Franklin | 12 | 41048 | 138.0 | 2 |
| 3 | None | 13 | 20562 | 138.0 | 1 |
| 4 | Jax | 13 | 12041 | 140.0 | 1 |
| ... | ... | ... | ... | ... | ... |
| 1287 | None | 12 | 303 | 137.0 | 1 |
| 1288 | None | 12 | 346 | 140.0 | 1 |
| 1289 | None | 8 | 736 | 135.0 | 1 |
| 1290 | A | 8 | 243 | 139.0 | 1 |
| 1291 | Olive | 11 | 398 | 136.0 | 1 |
1292 rows × 5 columns
The count of the number of times the tweet is favorited (or not) according to the images number in the tweet and the booleans retweeted and favorited.
twitter_archive_master.groupby(['favorited', 'img_num']).favorite_count.value_counts()
favorited img_num favorite_count
False 1 0 46
1691 3
265 2
345 2
693 2
..
4 33911 1
40325 1
True 1 2545 1
28150 1
3 24167 1
Name: favorite_count, Length: 1207, dtype: int64
The count of the dogs ratings according to their names and the images number in the tweet.
twitter_archive_master.groupby(['img_num', 'name']).rating_numerator.value_counts()
img_num name rating_numerator
1 A 11 8
9 4
10 4
7 2
8 2
..
4 Odie 10 1
Oliver 11 1
Paisley 13 1
Samson 13 1
Trooper 13 1
Name: rating_numerator, Length: 912, dtype: int64
The mean of text size according to the tweet source and the user language.
twitter_archive_master.groupby(['lang', 'source']).text_size.mean()
lang source
En TweetDeck 103.500000
Twitter Web Client 122.333333
Twitter for iPhone 111.922292
Vine - Make a Scene 101.142857
Eu Twitter for iPhone 51.000000
Nl Twitter for iPhone 97.500000
Name: text_size, dtype: float64
We chose a visualization of dog ratings according to its count as favorite, knowing that the thickness of a point corresponds to the text size and the color, to the images number.
fig = px.scatter(twitter_archive_master,
x='favorite_count',
y='rating_numerator',
size='text_size',
hover_name='name',
color = 'img_num',
log_x=True,
size_max=10)
fig.show()